HIHISIV: a database of gene expression in HIV and SIV host immune response

In the battle of the host against lentiviral pathogenesis, the immune response is crucial. However, several questions remain unanswered about the interaction with different viruses and their influence on disease progression. The simian immunodeficiency virus (SIV) infecting nonhuman primates (NHP) is widely used as a model for the study of the human immunodeficiency virus (HIV) both because they are evolutionarily linked and because they share physiological and anatomical similarities that are largely explored to understand the disease progression. The HIHISIV database was developed to support researchers to integrate and evaluate the large number of transcriptional data associated with the presence/absence of the pathogen (SIV or HIV) and the host response (NHP and human). The datasets are composed of microarray and RNA-Seq gene expression data that were selected, curated, analyzed, enriched, and stored in a relational database. Six query templates comprise the main data analysis functions and the resulting information can be downloaded. The HIHISIV database, available at https://hihisiv.github.io, provides accurate resources for browsing and visualizing results and for more robust analyses of pre-existing data in transcriptome repositories. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05740-7.

hallmarks of disease in these (non-naturally) infected macaques is similar to the human immunodeficiency disease caused by HIV-1 [7,8].Importantly, epidemiological surveys in free-ranging chimpanzees showed that SIVcpz has a substantial negative impact on health, reproduction and lifespan of infected animals indicating natural SIV infections are not necessarily non-pathogenic.
One of the major challenges in the study of these viruses is understanding the complex interplay with the host's immune system, which tends to involve the depletion of CD4 + T lymphocytes.In the last few decades, advances in high-throughput technologies has enabled the analysis of large numbers of transcripts simultaneously and inferring subsets of these transcripts that are associated with the same biological condition [9].Although the amount of data is constantly increasing in transcriptome repositories, such as GEO (Gene Expression Omnibus), ArrayExpress, and SRA (Sequence Read Archive), the data are not processed, standardized, or integrated.
While several HIV databases offer an integrated approach, providing a comprehensive toolkit for exploring host genome dynamics, understanding latency mechanisms, and decoding the functional implications of differentially expressed genes during HIV infection [10][11][12], there remains a notable gap in databases that also organize and present SIV gene expression data.
To address this gap, we introduce HIHISIV, a database dedicated to the host immune response in the expression profiles of both SIV and HIV infections.Developed to support researchers in identifying molecular signatures, co-expressed genes, experimental design, and essential host factors.Moreover, it serves as a valuable framework for researchers in generating hypotheses and advancing our understanding of these infections.The database is composed of microarray and RNA-Seq gene expression data of viral infection by SIV and HIV hosts retrieved from the GEO repository.The datasets were manually curated and reanalyzed using standardized methods and the metadata and results were annotated and enriched with ontology terms.The HIHISIV database is currently in version 2.0 and includes 63 transcriptome experiments stored in a relational database.Furthermore, HIHISIV has a user-friendly web interface where co-expression networks can be visualized as graph and tabular data views and the dataset can be downloaded for further analysis.

Dataset selection
The initial datasets consist of microarray (1-color mRNA) and RNA-Seq (mRNA) gene expression data related to viral infections with SIV or HIV-1, retrieved from the GEO from the National Center for Biotechnology Information (NCBI).We compiled the initial database by selecting searches for 'SIV' or 'HIV-1' projects.Each dataset was manually examined to identify and separate the groups for contrast (reference versus test).Additionally, we established exclusion criteria for GEO projects or samples, and the following criteria were adopted for removing projects or samples: (i) Lack of sample information preventing the differentiation of groups for comparison (e.g., host type); (ii) Projects or samples related to antiretroviral therapy (ART), vaccination, monoclonal antibodies, or cell cultures; (iii) Insufficient number of samples (less than 3 per group).
It's important to note that each of these criteria serves as an exclusion factor, thus, any datasets meeting any of these conditions were excluded.
As a proof of concept to create the pipeline and compose the conceptual model, we selected one short-read RNA-Seq project (GEO id: GSE119234) and applied the same exclusion criteria.
During the curation process, we encountered various terminologies that refer to the same entity.To ensure standardization across different datasets and establish consistency, we have adopted the following nomenclature: • natural host: refers to natural host primates such as African green monkeys (Chlorocebus aethiops) and sooty mangabeys (Cercocebus atys); • non-natural host: refers to non-natural host primates such as rhesus macaque (Macaca mulatta); • non-progressor: refers to a human that has not progressed to immunodeficiency.
• uninfected (EFO_0001460): uninfected class is a disposition in which the bearer is not known to be affected by a disease within the context of a study.• acute infection (IDO:0000627): an infectious disorder that is the physical basis for an unfolding acute infectious disease course.• chronic infection (IDO:0000628): an infection that persists for an extended period of time.
In each project, the data were compared pairwise, and each comparison was named as an experiment.In some cases, these projects resulted in more than one experiment.For example, the project GSE7157 worked with non-natural species during three different phases of the infection (uninfected, acute and chronic infections).This project was divided into three different pairwise experiments (uninfected versus acute: GSE7157_d1; uninfected versus chronic: GSE7157_d2, and acute versus chronic: GSE7157_d3).The complete list of experiments is shown in the Experiments list navigation bar entry on the database web page and in the Additional file 1.

Metadata and ontologies
To enrich the experimental information with ontology terms, we employed the GEOquery library, for retrieving metadata (including phenodata) from GEO datasets.The extracted metadata consisted of semi-structured information, including essential details in the experiments such as title, summary, and overall design.We used SpaCy model, a natural language processing (NLP) tool, to identify automatically the key elements from these metadata [13].After that, the spurious or irrelevant terms were carefully filtered and the remaining terms were mapped to their corresponding biological ontologies using rols library [14] to access and query the Ontology Lookup Service (OLS) improving the interoperability and integration of the HIHISIV database with other sources.

Microarray and RNA-Seq data analysis
Due to the heterogeneity of experiments conducted by different research teams (assay platforms, virus strains, tissues etc.), the analyses were conducted separately through the pipeline delineated in a dataflow shown in Fig. 1.The list of experiments with the corresponding normalization matrix and phenodata is available in the HIHISIV database.
All microarray data (1-color mRNA) were downloaded from the Affymetrix platform [15].The datasets were normalized using the MAS5 normalization method implemented in the Affy R package (version 1.78.2) [16] resulting in a normalized matrix with the probes.For each experiment, a pair was compared.The first condition was considered as the reference and the second condition as the test, for instance, uninfected (reference) versus acute infection (test).
The detection of the differentially expressed genes (DEGs) was based on normalized datasets by the fitting of the gene-wise linear model (for each probe) followed by moderated t-tests implemented in the limma package in R (version 3.56.2) [17].We calculated the up and down-regulated genes between the sample pairs through the adjusted p-value (default for queries: false-discovery rate (FDR) ≤ 0.05) and the log fold-change measures (default for queries: |log-FC|> 1).We kept the complete results of the analysis in the database and probes mapped to more than one entrez_gene_ id were inserted as separate tuples in the database.Orthologous genes in Macaca mulatta and Homo sapiens were mapped using the bioRmat R library [18].
To establish a standardized pipeline, an RNA-Seq (mRNA) dataset was collected from the SRA repository (SRS from NextSeq 550 Homo sapiens) [19].Firstly, the data were transformed to Fastq using the SRA toolkit (version 2.10.5) and the quality of reads was checked using FastQC.RSEM rsem-calculate-expression (version 1.3) was used to align the transcriptome dataset in the genome reference (UCSC hg19) and after the mapping, the DEGs were evaluated using the library limma (limma-voom).The annotation was conducted by the R libraries TxDBHsapiens, HomoSapiens, and GenomicFeatures (version 1.52.1)[20].The gene identifiers ('ucsc_id') were kept.

Enrichment analysis, ingestion into database, and query design
We conducted enrichment analysis using the hypergeometric test implemented in the GOStats package [21] to identify relevant Gene Ontology (GO) terms associated with biological processes [22].The gene subset was defined by the set of differentially expressed (DE) genes obtained from pairwise comparisons between a test and a reference condition (threshold criteria: adj_p value ≤ 0.05 and |log-FC|> 1).To mitigate potential bias, as demonstrated by Timmons, Szkop and Gallagher [23], we applied a more restrictive cut-off (p-value < 0.001) to reduce detection bias.
The results of the analysis were ingested into the database (the technical description can be found in Additional file 2) and the single gene expression and co-expression information, presented visually as networks, are extracted by querying the database tables in SQL.

System architecture, implementation, and access
The HIHISIV database was designed and implemented as a multiple-component framework, as represented in Fig. 2 The static web page contains a description of how the database was built and a link to the source data that were used.The web page was written using the Jekyll framework.Finally, the Nginx proxy was used to route the requests from the Internet to the web app.For each of these components, a docker container was built and dockercompose was used to connect and deploy them.The system is hosted at the Data Extreme Laboratory (DEXL) of the National Laboratory for Scientific Computing (LNCC) in Brazil.

Support for the FAIR principles
The FAIR principles [24] are considered a gold standard for research data management.In this work, we follow best practices for improving the support for them in the HIHISIV database.For improving findability, basic metadata using schema.org[25] was added to the headers of the HIHISIV web page, allowing for indexing by search engines and dataset repositories.HIHISIV data and metadata are openly accessible through its URL (https:// hihis iv.github.io) using the HTTP protocol.When applicable, terms used both in the database schema and in its web interface follow Gene Ontology [22] and other ontologies such as EFO (The Experimental Factor Ontology), UBERON (Uber-anatomy ontology), PATO (the Phenotype And Trait Ontology), MONDO (Mondo Disease Ontology) and NCIT (The National Cancer Institute Thesaurus) improving interoperability.For better reusability, the database contains references to the original data used in the analyses.The source code for the analysis workflows, the database schema and initialization scripts, and the web application are available on GitHub (https:// github.com/ quelo pes/ HIHIS IV).This article describes v2.0 of the database, which is archived on Zenodo under 10.5281/zenodo.7093185.

Querying in the user interface
The current version of the database (2.0) comprises a total of 62 experiments sourced from the 14 GEO transcriptome repository.The database includes a total of 1,816 differentially expressed genes (DEGs), considering gene symbols.Additionally, transcript IDs were retained in the results and are displayed in the query interface, totaling 15,026 unique transcript IDs.The criteria for this count were set as adjusted p-value ≤ 0.05 and |log-FC|> 1.However, it is noteworthy that these thresholds were primarily applied to highlight significant genes found in the analysis.Nonetheless, all results were retained in the database to provide flexibility for adjustments by researchers.Concerning Gene Ontology (GO) Biological Process terms, we selected a total of 1249 unique terms when applying the thresholds outlined in the 'Enrichment analysis, ingestion into database, and query design' section.
HIHISIV can be queried through a web interface in which the user can configure and adjust the parameters of a query.There are six query interfaces with major analytic functions, which are 'Gene' , 'Transcripts' , 'Biological Process (GO)' , 'Ontology terms' , 'Single gene co-expression network' , and 'Gene set co-expression network' .The resulting information can be exported in a comma-separated values (CSV) file for further data manipulation.Public access and documentation are freely available through the database web page.Next, we provide a set of illustrative examples in a biological context to demonstrate how researchers can utilize HIHISIV to explore relevant datasets, identify key genes associated with immune responses, analyze co-expression networks, and elucidate relevant biological processes related to SIV and HIV-1 infections.

Gene
In this mode, the user can select the conditions under which a gene of interest shows differential expression.As an example, Fig. 3 presents the results for the gene USP18, (ubiquitin specific peptidase 18) that belongs to the ubiquitin-specific proteases (UBP).The USP18 gene and its protein product, USP18, are known to be involved in the innate immune response to viral infections, particularly in response to type I interferons, which are cytokines produced by the body during viral infections [12].The applied thresholds for differential expression analysis were adjusted p-value (adj_p value) ≤ 0.05 and |log-FC|> 1.The results show this gene was DE in seven experiments, for example in the comparison rhesus uninfected versus acute infection (GSE16147_d5, GSE17626_d1, GSE61766_d1), humans uninfected versus acute infection (GSE6740_d18) and human non-progressor versus acute infection (GSE6740_d21).

Transcript
This mode is similar to the previous one, but instead of querying by gene symbol or entrez_gene_id, the user may be interested in a specific probe or transcript id.In the example shown in Fig. 4, the probe id 'Mmu.STS.4748.a.s1_at' is selected based on the criteria of adj_p value ≤ 0.05 and |log-FC|> 1.As this probe id is specific to microarray platforms in M. mulatta (GPL3535-Affymetrix Rhesus Macaque Genome Array), the results show only experiments that used this platform.

Gene ontology (biological process)
This mode presents the results of the enrichment for GO terms related to BP for each comparison derived from the DEG analysis.By utilizing this mode, researchers can identify experiments that exhibit enrichment in specific BP terms of interest.For instance, in Fig. 5, the BP term 'cellular response to type II interferon' (GO:0071346) was found enriched in the experiments GSE6740_d1 (CD4 + versus CD8 + in acute infection), GSE6740_d2 (CD4 + versus CD8 + in chronic infection) and GSE13824_d1 (uninfected versus infected).Additionally, this mode provides information about the genes that were enriched within this specific ontology.To ensure statistical robustness, only the results with a significant p-value < 0.001 are displayed in the database.

Ontology terms
By integrating ontology terms with the experimental information, our approach provided a comprehensive understanding of the underlying biological contexts and enriched the dataset's metadata with valuable semantic annotations.In the 'Ontology term' mode the example in Fig. 6 shows the experiment associated with the term 'Encephalitis' (project GSE13824).

Single gene co-expression network
This query mode 'Single gene co-expression network' represents the genes that were coexpressed with a selected target gene.In the example shown in Fig. 7, we are using the same gene as in the 'Gene' mode, USP18 (adj_p value ≤ 0.05 and |log-FC|> 1).The result displays a set of genes that are co-expressed with USP18 in different experiments (e.g., IFIT1, PCLAF, IFI35, ISG15 and OAS2).The thickness of the connection lines between genes depends on the number of co-expressed experiments that the target gene has with Fig. 5 An example of a 'Gene Ontology' query shows the result for the biological process domain for the GO term 'cellular response to type II interferon' Fig. 6 'Ontology term' mode shows the project (GSE13824) associated with the term 'Encephalitis' the resulting gene.For instance, IFIT1 (interferon induced protein with tetratricopeptide repeats 1) is a gene that encodes a protein that may inhibit viral replication and translational initiation (provided by RefSeq, Aug 2012).This gene is co-expressed with the USP18 gene in seven experiments, including GSE16147_d5, GSE13824_d1, and GSE6740_d22.The query result can be visualized as a graph representing the gene coexpression network as well as a tabular format.

Gene set co-expression network
In this query mode, instead of single gene co-expression, a set of genes can be entered and the result displays a visual network representation.The associated table facilitates a comprehensive understanding of the co-expression patterns and experimental associations among the genes of interest.For instance, in the example shown in Fig. 8, we used the gene mentioned in the previous examples, USP18, along with other genes such as STAT1, SP100, IFI35, APOBEC3A, MX1, and CXCR5 (adj_p value ≤ 0.05 and |log-FC|> 1).The query result reveals an interconnected network of genes associated with the immune response, particularly in defense against pathogens such as viruses.Additionally, the result is presented as a table indicating the experiments associated with each pair of connections.The user can also choose if disconnected nodes, i.e., genes that are not co-expressed with other genes, are also displayed in the network representation.

Conclusions
We present the HIHISIV database that provides a comprehensive integrated view of immune host response in SIV and HIV hosts.The data made available through HIHISIV, following best practices for supporting the FAIR principles, is based on aggregation of metadata and a workflow for analyzing microarray and RNA-Seq datasets and annotations.The workflow identifies differentially expressed genes in the different studies analyzed and adds other types of interactions and relevant roles that these genes have.The HIHISIV database contains a web page with an easy-to-use interface for biologists to search and browse for genes and experimentally testable new hypotheses of molecular mechanisms related to the infection process in HIV/SIV and host types.The database also has additional information about viruses, documentation, experiments list, and external sources.Our objective is to continue the development of the HIHISIV around the querying, metadata, analysis functionality, and addition of new datasets from main transcriptome repositories.

Fig. 1
Fig. 1 Dataflow used to generate data for HIHISIV showing metadata and raw data input, their RNA-Seq, microarray, and enrichment analyses, the metadata extraction of relevant entities, and insertion of all results in the database . The database was implemented and instantiated in a PostgreSQL (version 15.3) server following the same conceptual model described previously (Additional file 2).The web application provides a user-friendly interface implemented in the Streamlit (version 1.25.0)Python library with predefined parameterizable queries to be executed in the database.It enables the analysis of gene expression levels in HIV-1 and SIV hosts.The results are displayed in tables or networks, and all the data can be downloaded by the users.Other Python libraries used in the implementation include Psycopg2 (version 2.9.5), for connecting to the PostgreSQL database and executing queries; Pandas (version 1.5.3) for manipulating data frames; and NetworkX (version 3.1) for building networks.

Fig. 7
Fig. 7 Example of 'Single gene co-expressed network' query for the gene USP18.The connections in this gene represent the genes co-expressed according to the parameters (adj_p value ≤ 0.05 and |log-FC|> 1) and the thickness of the connection varies according to the number of co-expressed experiments that the target gene has with the resulting gene